Customer Segmentation and Market Basket Analysis#

image Introduction: This notebook we using publicly dataset On-line Retail to explore customer segmentation through the interesting task of unsupervised learning method. Then we go further and apply association rule mining approach to find interesting rules and patterns in this transaction database. These customer segmentation, rules and patterns can be used to make interesting and useful decisions as far as user interest is concerned.

The Online Retail a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

But first, let’s see some deep understand of both concepts:

  • Customer segmentation: Customer segmentation is the problem of uncovering information about a firm’s customer base, based on their interactions with the business. In most cases this interaction is in terms of their purchase behavior and patterns. We explore some of the ways in which this can be used.

  • Market basket analysis: Market basket analysis is a method to gain insights into granular behavior of customers. This is helpful in devising strategies which uncovers deeper understanding of purchase decisions taken by the customers. This is interesting as a lot of times even the customer will be unaware of such biases or trends in their purchasing behavior.

Table of Contents

Load Dependencies and Configuration Settings#

We started with the installation of the orange3 package through the command line, since it is not possible to include it through the usual procedure of adding custom packages in the Kernel.

pip install orange3
^C
Note: you may need to restart the kernel to use updated packages.
pip install pandasql
Requirement already satisfied: pandasql in /usr/local/python/3.10.13/lib/python3.10/site-packages (0.7.3)
Requirement already satisfied: numpy in /home/codespace/.local/lib/python3.10/site-packages (from pandasql) (1.26.4)
Requirement already satisfied: pandas in /home/codespace/.local/lib/python3.10/site-packages (from pandasql) (2.2.1)
Requirement already satisfied: sqlalchemy in /usr/local/python/3.10.13/lib/python3.10/site-packages (from pandasql) (2.0.29)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/codespace/.local/lib/python3.10/site-packages (from pandas->pandasql) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /home/codespace/.local/lib/python3.10/site-packages (from pandas->pandasql) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /home/codespace/.local/lib/python3.10/site-packages (from pandas->pandasql) (2024.1)
Requirement already satisfied: typing-extensions>=4.6.0 in /home/codespace/.local/lib/python3.10/site-packages (from sqlalchemy->pandasql) (4.10.0)
Requirement already satisfied: greenlet!=0.4.17 in /usr/local/python/3.10.13/lib/python3.10/site-packages (from sqlalchemy->pandasql) (3.0.3)
Requirement already satisfied: six>=1.5 in /home/codespace/.local/lib/python3.10/site-packages (from python-dateutil>=2.8.2->pandas->pandasql) (1.16.0)
Note: you may need to restart the kernel to use updated packages.
import os
import warnings
warnings.simplefilter(action = 'ignore', category=FutureWarning)
warnings.filterwarnings('ignore')
def ignore_warn(*args, **kwargs):
    pass

warnings.warn = ignore_warn #ignore annoying warning (from sklearn and seaborn)

import pandas as pd
import datetime
import math
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import matplotlib.cm as cm

%matplotlib inline

from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())

import seaborn as sns
sns.set(style="ticks", color_codes=True, font_scale=1.5)
color = sns.color_palette()
sns.set_style('darkgrid')

from mpl_toolkits.mplot3d import Axes3D

import plotly as py
import plotly.graph_objs as go
py.offline.init_notebook_mode()

from scipy import stats
from scipy.stats import skew, norm, probplot, boxcox
from sklearn import preprocessing
import math

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score

import Orange
from Orange.data import Domain, DiscreteVariable, ContinuousVariable
from orangecontrib.associate.fpgrowth import *
cs_df = pd.read_excel(io=r'/workspaces/tugas/OnlineRetail.xlsx')

Exploratory Data Analysis (EDA)#

image

Take a First Look of our Data:#

I created the function below to simplify the analysis of general characteristics of the data. Inspired on the str function of R, this function returns the types, counts, distinct, count nulls, missing ratio and uniques values of each field/feature.

If the study involve some supervised learning, this function can return the study of the correlation, for this we just need provide the dependent variable to the pred parameter.

Also, if its return is stored in a variable you can evaluate it in more detail, focus on specific field, or sort them from different perspectives.

import pandas as pd

def rstr(df, pred=None): 
    obs = df.shape[0]
    types = df.dtypes
    counts = df.apply(lambda x: x.count())
    uniques = df.apply(lambda x: x.nunique())
    nulls = df.isnull().sum()
    distincts = df.apply(lambda x: x.unique().shape[0])
    missing_ratio = (df.isnull().sum() / obs) * 100
    
    # Calculate skewness and kurtosis only for numeric columns
    numeric_cols = df.select_dtypes(include=['number']).columns
    skewness = df[numeric_cols].skew()
    kurtosis = df[numeric_cols].kurt() 
    
    print('Data shape:', df.shape)
    
    if pred is None:
        cols = ['types', 'counts', 'distincts', 'nulls', 'missing ratio', 'uniques', 'skewness', 'kurtosis']
        str_summary = pd.concat([types, counts, distincts, nulls, missing_ratio, uniques, skewness, kurtosis], axis=1, sort=True)
    else:
        corr = df.corr()[pred]
        str_summary = pd.concat([types, counts, distincts, nulls, missing_ratio, uniques, skewness, kurtosis, corr], axis=1, sort=True)
        corr_col = 'corr ' + pred
        cols = ['types', 'counts', 'distincts', 'nulls', 'missing ratio', 'uniques', 'skewness', 'kurtosis', corr_col]
    
    str_summary.columns = cols
    dtypes = str_summary.types.value_counts()
    print('_________\nData types:\n', dtypes)
    print('_________')
    return str_summary

details = rstr(cs_df)
display(details.sort_values(by='missing ratio', ascending=False))
Data shape: (541909, 8)
_________
Data types:
 types
object            4
float64           2
datetime64[ns]    1
int64             1
Name: count, dtype: int64
_________
types counts distincts nulls missing ratio uniques skewness kurtosis
CustomerID float64 406829 4373 135080 24.926694 4372 0.029835 -1.179982
Description object 540455 4224 1454 0.268311 4223 NaN NaN
Country object 541909 38 0 0.000000 38 NaN NaN
InvoiceDate datetime64[ns] 541909 23260 0 0.000000 23260 NaN NaN
InvoiceNo object 541909 25900 0 0.000000 25900 NaN NaN
Quantity int64 541909 722 0 0.000000 722 -0.264076 119769.160031
StockCode object 541909 4070 0 0.000000 4070 NaN NaN
UnitPrice float64 541909 1630 0 0.000000 1630 186.506972 59005.719097

Let’s see the description of each column:

  • InvoiceNo: A unique identifier for the invoice. An invoice number shared across rows means that those transactions were performed in a single invoice (multiple purchases).

  • StockCode: Identifier for items contained in an invoice.

  • Description: Textual description of each of the stock item.

  • Quantity: The quantity of the item purchased.

  • InvoiceDate: Date of purchase.

  • UnitPrice: Value of each item.

  • CustomerID: Identifier for customer making the purchase.

  • Country: Country of customer.

cs_df.describe()
Quantity InvoiceDate UnitPrice CustomerID
count 541909.000000 541909 541909.000000 406829.000000
mean 9.552250 2011-07-04 13:34:57.156386048 4.611114 15287.690570
min -80995.000000 2010-12-01 08:26:00 -11062.060000 12346.000000
25% 1.000000 2011-03-28 11:34:00 1.250000 13953.000000
50% 3.000000 2011-07-19 17:17:00 2.080000 15152.000000
75% 10.000000 2011-10-19 11:27:00 4.130000 16791.000000
max 80995.000000 2011-12-09 12:50:00 38970.000000 18287.000000
std 218.081158 NaN 96.759853 1713.600303

We can observe from the preceding output that Quantity and UnitPrice are having negative values, which may mean that we may have some return transactions in our data also. As our goal is customer segmentation and market basket analysis, it is important that these records are removed, but first we will take a look at whether there are records where both are negative or if one of them is negative and the other is zero.

print('Check if we had negative quantity and prices at same register:',
     'No' if cs_df[(cs_df.Quantity<0) & (cs_df.UnitPrice<0)].shape[0] == 0 else 'Yes', '\n')
print('Check how many register we have where quantity is negative',
      'and prices is 0 or vice-versa:',
      cs_df[(cs_df.Quantity<=0) & (cs_df.UnitPrice<=0)].shape[0])
print('\nWhat is the customer ID of the registers above:',
      cs_df.loc[(cs_df.Quantity<=0) & (cs_df.UnitPrice<=0), 
                ['CustomerID']].CustomerID.unique())
print('\n% Negative Quantity: {:3.2%}'.format(cs_df[(cs_df.Quantity<0)].shape[0]/cs_df.shape[0]))
print('\nAll register with negative quantity has Invoice start with:', 
      cs_df.loc[(cs_df.Quantity<0) & ~(cs_df.CustomerID.isnull()), 'InvoiceNo'].apply(lambda x: x[0]).unique())
print('\nSee an example of negative quantity and others related records:')
display(cs_df[(cs_df.CustomerID==12472) & (cs_df.StockCode==22244)])
Check if we had negative quantity and prices at same register: No 

Check how many register we have where quantity is negative and prices is 0 or vice-versa: 1336

What is the customer ID of the registers above: [nan]

% Negative Quantity: 1.96%

All register with negative quantity has Invoice start with: ['C']

See an example of negative quantity and others related records:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
1973 C536548 22244 3 HOOK HANGER MAGIC GARDEN -4 2010-12-01 14:33:00 1.95 12472.0 Germany
9438 537201 22244 3 HOOK HANGER MAGIC GARDEN 12 2010-12-05 14:19:00 1.95 12472.0 Germany
121980 546843 22244 3 HOOK HANGER MAGIC GARDEN 12 2011-03-17 12:40:00 1.95 12472.0 Germany
print('Check register with UnitPrice negative:')
display(cs_df[(cs_df.UnitPrice<0)])
print("Sales records with Customer ID and zero in Unit Price:",cs_df[(cs_df.UnitPrice==0)  & ~(cs_df.CustomerID.isnull())].shape[0])
cs_df[(cs_df.UnitPrice==0)  & ~(cs_df.CustomerID.isnull())]
Check register with UnitPrice negative:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
299983 A563186 B Adjust bad debt 1 2011-08-12 14:51:00 -11062.06 NaN United Kingdom
299984 A563187 B Adjust bad debt 1 2011-08-12 14:52:00 -11062.06 NaN United Kingdom
Sales records with Customer ID and zero in Unit Price: 40
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
9302 537197 22841 ROUND CAKE TIN VINTAGE GREEN 1 2010-12-05 14:02:00 0.0 12647.0 Germany
33576 539263 22580 ADVENT CALENDAR GINGHAM SACK 4 2010-12-16 14:36:00 0.0 16560.0 United Kingdom
40089 539722 22423 REGENCY CAKESTAND 3 TIER 10 2010-12-21 13:45:00 0.0 14911.0 EIRE
47068 540372 22090 PAPER BUNTING RETROSPOT 24 2011-01-06 16:41:00 0.0 13081.0 United Kingdom
47070 540372 22553 PLASTERS IN TIN SKULLS 24 2011-01-06 16:41:00 0.0 13081.0 United Kingdom
56674 541109 22168 ORGANISER WOOD ANTIQUE WHITE 1 2011-01-13 15:10:00 0.0 15107.0 United Kingdom
86789 543599 84535B FAIRY CAKES NOTEBOOK A6 SIZE 16 2011-02-10 13:08:00 0.0 17560.0 United Kingdom
130188 547417 22062 CERAMIC BOWL WITH LOVE HEART DESIGN 36 2011-03-23 10:25:00 0.0 13239.0 United Kingdom
139453 548318 22055 MINI CAKE STAND HANGING STRAWBERY 5 2011-03-30 12:45:00 0.0 13113.0 United Kingdom
145208 548871 22162 HEART GARLAND RUSTIC PADDED 2 2011-04-04 14:42:00 0.0 14410.0 United Kingdom
157042 550188 22636 CHILDS BREAKFAST SET CIRCUS PARADE 1 2011-04-14 18:57:00 0.0 12457.0 Switzerland
187613 553000 47566 PARTY BUNTING 4 2011-05-12 15:21:00 0.0 17667.0 United Kingdom
198383 554037 22619 SET OF 6 SOLDIER SKITTLES 80 2011-05-20 14:13:00 0.0 12415.0 Australia
279324 561284 22167 OVAL WALL MIRROR DIAMANTE 1 2011-07-26 12:24:00 0.0 16818.0 United Kingdom
282912 561669 22960 JAM MAKING SET WITH JARS 11 2011-07-28 17:09:00 0.0 12507.0 Spain
285657 561916 M Manual 1 2011-08-01 11:44:00 0.0 15581.0 United Kingdom
298054 562973 23157 SET OF 6 NATIVITY MAGNETS 240 2011-08-11 11:42:00 0.0 14911.0 EIRE
314745 564651 23270 SET OF 2 CERAMIC PAINTED HEARTS 96 2011-08-26 14:19:00 0.0 14646.0 Netherlands
314746 564651 23268 SET OF 2 CERAMIC CHRISTMAS REINDEER 192 2011-08-26 14:19:00 0.0 14646.0 Netherlands
314747 564651 22955 36 FOIL STAR CAKE CASES 144 2011-08-26 14:19:00 0.0 14646.0 Netherlands
314748 564651 21786 POLKADOT RAIN HAT 144 2011-08-26 14:19:00 0.0 14646.0 Netherlands
358655 568158 PADS PADS TO MATCH ALL CUSHIONS 1 2011-09-25 12:22:00 0.0 16133.0 United Kingdom
361825 568384 M Manual 1 2011-09-27 09:46:00 0.0 12748.0 United Kingdom
379913 569716 22778 GLASS CLOCHE SMALL 2 2011-10-06 08:17:00 0.0 15804.0 United Kingdom
395529 571035 M Manual 1 2011-10-13 12:50:00 0.0 12446.0 RSA
420404 572893 21208 PASTEL COLOUR HONEYCOMB FAN 5 2011-10-26 14:36:00 0.0 18059.0 United Kingdom
436428 574138 23234 BISCUIT TIN VINTAGE CHRISTMAS 216 2011-11-03 11:26:00 0.0 12415.0 Australia
436597 574175 22065 CHRISTMAS PUDDING TRINKET POT 12 2011-11-03 11:47:00 0.0 14110.0 United Kingdom
436961 574252 M Manual 1 2011-11-03 13:24:00 0.0 12437.0 France
439361 574469 22385 JUMBO BAG SPACEBOY DESIGN 12 2011-11-04 11:55:00 0.0 12431.0 Australia
446125 574879 22625 RED KITCHEN SCALES 2 2011-11-07 13:22:00 0.0 13014.0 United Kingdom
446793 574920 22899 CHILDREN'S APRON DOLLY GIRL 1 2011-11-07 16:34:00 0.0 13985.0 United Kingdom
446794 574920 23480 MINI LIGHTS WOODLAND MUSHROOMS 1 2011-11-07 16:34:00 0.0 13985.0 United Kingdom
454463 575579 22437 SET OF 9 BLACK SKULL BALLOONS 20 2011-11-10 11:49:00 0.0 13081.0 United Kingdom
454464 575579 22089 PAPER BUNTING VINTAGE PAISLEY 24 2011-11-10 11:49:00 0.0 13081.0 United Kingdom
479079 577129 22464 HANGING METAL HEART LANTERN 4 2011-11-17 19:52:00 0.0 15602.0 United Kingdom
479546 577168 M Manual 1 2011-11-18 10:42:00 0.0 12603.0 Germany
480649 577314 23407 SET OF 2 TRAYS HOME SWEET HOME 2 2011-11-18 13:23:00 0.0 12444.0 Norway
485985 577696 M Manual 1 2011-11-21 11:57:00 0.0 16406.0 United Kingdom
502122 578841 84826 ASSTD DESIGN 3D PAPER STICKERS 12540 2011-11-25 15:57:00 0.0 13256.0 United Kingdom

As you can see, there are no records where quantity and price are negative, but there are 1.336 records where one of them is and the other is 0. However, note that for all these records we do not have the customer ID. So we conclude that we can erase all records in that quantity or the price and negative. In addition, by the foregoing summary we see that there are 135,080 records without customer identification that we may also disregard.

# Remove rows with missing CustomerID
cs_df = cs_df.dropna(subset=['CustomerID'])

# Remove negative or return transactions
cs_df = cs_df[(cs_df['Quantity'] >= 0) & (cs_df['UnitPrice'] > 0)]

details = rstr(cs_df)
display(details.sort_values(by='distincts', ascending=False))
Data shape: (397884, 8)
_________
Data types:
 types
object            4
float64           2
datetime64[ns]    1
int64             1
Name: count, dtype: int64
_________
types counts distincts nulls missing ratio uniques skewness kurtosis
InvoiceNo object 397884 18532 0 0.0 18532 NaN NaN
InvoiceDate datetime64[ns] 397884 17282 0 0.0 17282 NaN NaN
CustomerID float64 397884 4338 0 0.0 4338 0.025729 -1.180822
Description object 397884 3877 0 0.0 3877 NaN NaN
StockCode object 397884 3665 0 0.0 3665 NaN NaN
UnitPrice float64 397884 440 0 0.0 440 204.032727 58140.396673
Quantity int64 397884 301 0 0.0 301 409.892972 178186.243253
Country object 397884 37 0 0.0 37 NaN NaN

After this first cleanup, note that we still have more description than inventory codes, so we still have some inconsistency on the basis that requires further investigation. Let´s see it:

cat_des_df = cs_df.groupby(["StockCode","Description"]).count().reset_index()
display(cat_des_df.StockCode.value_counts()[cat_des_df.StockCode.value_counts()>1].reset_index().head())
cs_df[cs_df['StockCode'] == cat_des_df.StockCode.value_counts()[cat_des_df.StockCode.value_counts()>1]
      .reset_index()['index'][4]]['Description'].unique()
StockCode count
0 23196 4
1 23236 4
2 23203 3
3 17107D 3
4 23535 3
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/.local/lib/python3.10/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
   3804 try:
-> 3805     return self._engine.get_loc(casted_key)
   3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'index'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[10], line 3
      1 cat_des_df = cs_df.groupby(["StockCode","Description"]).count().reset_index()
      2 display(cat_des_df.StockCode.value_counts()[cat_des_df.StockCode.value_counts()>1].reset_index().head())
----> 3 cs_df[cs_df['StockCode'] == cat_des_df.StockCode.value_counts()[cat_des_df.StockCode.value_counts()>1]
      4       .reset_index()['index'][4]]['Description'].unique()

File ~/.local/lib/python3.10/site-packages/pandas/core/frame.py:4090, in DataFrame.__getitem__(self, key)
   4088 if self.columns.nlevels > 1:
   4089     return self._getitem_multilevel(key)
-> 4090 indexer = self.columns.get_loc(key)
   4091 if is_integer(indexer):
   4092     indexer = [indexer]

File ~/.local/lib/python3.10/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
   3807     if isinstance(casted_key, slice) or (
   3808         isinstance(casted_key, abc.Iterable)
   3809         and any(isinstance(x, slice) for x in casted_key)
   3810     ):
   3811         raise InvalidIndexError(key)
-> 3812     raise KeyError(key) from err
   3813 except TypeError:
   3814     # If we have a listlike key, _check_indexing_error will raise
   3815     #  InvalidIndexError. Otherwise we fall through and re-raise
   3816     #  the TypeError.
   3817     self._check_indexing_error(key)

KeyError: 'index'

This gives the multiple descriptions for one of those items and we witness the simple ways in which data quality can be corrupted in any dataset. A simple spelling mistake can end up in reducing data quality and an erroneous analysis.

unique_desc = cs_df[["StockCode", "Description"]].groupby(by=["StockCode"]).\
                apply(pd.DataFrame.mode).reset_index(drop=True)
q = '''
select df.InvoiceNo, df.StockCode, un.Description, df.Quantity, df.InvoiceDate,
       df.UnitPrice, df.CustomerID, df.Country
from cs_df as df INNER JOIN 
     unique_desc as un on df.StockCode = un.StockCode
'''

cs_df = pysqldf(q)
cs_df.InvoiceDate = pd.to_datetime(cs_df.InvoiceDate)
cs_df['amount'] = cs_df.Quantity*cs_df.UnitPrice
cs_df.CustomerID = cs_df.CustomerID.astype('Int64')

details = rstr(cs_df)
display(details.sort_values(by='distincts', ascending=False))
Data shape: (397884, 9)
_________
Data types:
 types
object            3
int64             2
float64           2
Int64             1
datetime64[ns]    1
Name: count, dtype: int64
_________
types counts distincts nulls missing ratio uniques skewness kurtosis
InvoiceNo int64 397884 18532 0 0.0 18532 -0.178524 -1.200748
InvoiceDate datetime64[ns] 397884 17282 0 0.0 17282 <NA> <NA>
CustomerID Int64 397884 4338 0 0.0 4338 0.025729 -1.180822
StockCode object 397884 3665 0 0.0 3665 <NA> <NA>
Description object 397884 3647 0 0.0 3647 <NA> <NA>
amount float64 397884 2939 0 0.0 2939 451.443182 232155.117219
UnitPrice float64 397884 440 0 0.0 440 204.032727 58140.396673
Quantity int64 397884 301 0 0.0 301 409.892972 178186.243253
Country object 397884 37 0 0.0 37 <NA> <NA>
fig = plt.figure(figsize=(25, 7))
f1 = fig.add_subplot(121)
g = cs_df.groupby(["Country"]).amount.sum().sort_values(ascending = False).plot(kind='bar', title='Amount Sales by Country')
cs_df['Internal'] = cs_df.Country.apply(lambda x: 'Yes' if x=='United Kingdom' else 'No' )
f2 = fig.add_subplot(122)
market = cs_df.groupby(["Internal"]).amount.sum().sort_values(ascending = False)
g = plt.pie(market, labels=market.index, autopct='%1.1f%%', shadow=True, startangle=90)
plt.title('Internal Market')
plt.show()
_images/db556d406296ed19a70733768a7b494ed866bd644b7637950c5d29e75523fac0.png
fig = plt.figure(figsize=(25, 7))
PercentSales =  np.round((cs_df.groupby(["CustomerID"]).amount.sum().\
                          sort_values(ascending = False)[:51].sum()/cs_df.groupby(["CustomerID"]).\
                          amount.sum().sort_values(ascending = False).sum()) * 100, 2)
g = cs_df.groupby(["CustomerID"]).amount.sum().sort_values(ascending = False)[:51].\
    plot(kind='bar', title='Top Customers: {:3.2f}% Sales Amount'.format(PercentSales))

fig = plt.figure(figsize=(25, 7))
f1 = fig.add_subplot(121)
PercentSales =  np.round((cs_df.groupby(["CustomerID"]).amount.sum().\
                          sort_values(ascending = False)[:10].sum()/cs_df.groupby(["CustomerID"]).\
                          amount.sum().sort_values(ascending = False).sum()) * 100, 2)
g = cs_df.groupby(["CustomerID"]).amount.sum().sort_values(ascending = False)[:10]\
    .plot(kind='bar', title='Top 10 Customers: {:3.2f}% Sales Amont'.format(PercentSales))
f1 = fig.add_subplot(122)
PercentSales =  np.round((cs_df.groupby(["CustomerID"]).amount.count().\
                          sort_values(ascending = False)[:10].sum()/cs_df.groupby(["CustomerID"]).\
                          amount.count().sort_values(ascending = False).sum()) * 100, 2)
g = cs_df.groupby(["CustomerID"]).amount.count().sort_values(ascending = False)[:10].\
    plot(kind='bar', title='Top 10 Customers: {:3.2f}% Event Sales'.format(PercentSales))
_images/3d95b741d5a7cd17873920f6ce146551ca57ab835b8ff1a689699187fddfceea.png _images/4845f9d0a1caa520401fdbe817abd439e621bcefb325ab66599b5fb67e18945d.png
AmoutSum = cs_df.groupby(["Description"]).amount.sum().sort_values(ascending = False)
inv = cs_df[["Description", "InvoiceNo"]].groupby(["Description"]).InvoiceNo.unique().\
      agg(np.size).sort_values(ascending = False)

fig = plt.figure(figsize=(25, 7))
f1 = fig.add_subplot(121)
Top10 = list(AmoutSum[:10].index)
PercentSales =  np.round((AmoutSum[Top10].sum()/AmoutSum.sum()) * 100, 2)
PercentEvents = np.round((inv[Top10].sum()/inv.sum()) * 100, 2)
g = AmoutSum[Top10].\
    plot(kind='bar', title='Top 10 Products in Sales Amount: {:3.2f}% of Amount and {:3.2f}% of Events'.\
                       format(PercentSales, PercentEvents))

f1 = fig.add_subplot(122)
Top10Ev = list(inv[:10].index)
PercentSales =  np.round((AmoutSum[Top10Ev].sum()/AmoutSum.sum()) * 100, 2)
PercentEvents = np.round((inv[Top10Ev].sum()/inv.sum()) * 100, 2)
g = inv[Top10Ev].\
    plot(kind='bar', title='Events of top 10 most sold products: {:3.2f}% of Amount and {:3.2f}% of Events'.\
                       format(PercentSales, PercentEvents))


fig = plt.figure(figsize=(25, 7))
Top15ev = list(inv[:15].index)
PercentSales =  np.round((AmoutSum[Top15ev].sum()/AmoutSum.sum()) * 100, 2)
PercentEvents = np.round((inv[Top15ev].sum()/inv.sum()) * 100, 2)
g = AmoutSum[Top15ev].sort_values(ascending = False).\
    plot(kind='bar', 
         title='Sales Amount of top 15 most sold products: {:3.2f}% of Amount and {:3.2f}% of Events'.\
         format(PercentSales, PercentEvents))

fig = plt.figure(figsize=(25, 7))
Top50 = list(AmoutSum[:50].index)
PercentSales =  np.round((AmoutSum[Top50].sum()/AmoutSum.sum()) * 100, 2)
PercentEvents = np.round((inv[Top50].sum()/inv.sum()) * 100, 2)
g = AmoutSum[Top50].\
    plot(kind='bar', 
         title='Top 50 Products in Sales Amount: {:3.2f}% of Amount and {:3.2f}% of Events'.\
         format(PercentSales, PercentEvents))

fig = plt.figure(figsize=(25, 7))
Top50Ev = list(inv[:50].index)
PercentSales =  np.round((AmoutSum[Top50Ev].sum()/AmoutSum.sum()) * 100, 2)
PercentEvents = np.round((inv[Top50Ev].sum()/inv.sum()) * 100, 2)
g = inv[Top50Ev].\
    plot(kind='bar', title='Top 50 most sold products: {:3.2f}% of Amount and {:3.2f}% of Events'.\
                       format(PercentSales, PercentEvents))
_images/5f4a37065b8a63c8b784a780f855ed07ead2c43b89d0bc323aa57bf276ad7693.png _images/3a1644e5d073eafaf7c675c646fdd331fa40d2f902fa9635da082a5d9e3b665b.png _images/5b37e2e02b400989688b0de3d175b9966c1449a68afdc80b7b0becaaa243b1aa.png _images/d9d6722f43125e938aa11e9ff25055c32bd102262635d5ba59b13102a22f10d3.png

Customer Segmentation:#

Customer segmentation is similarly the process of dividing an organization’s customer bases into different sections or segments based on various customer attributes. The process of customer segmentation is based on the premise of finding differences among the customers’ behavior and patterns.

The major objectives and benefits behind the motivation for customer segmentation are:

  • Higher Revenue: This is the most obvious requirement of any customer segmentation project.

  • Customer Understanding: One of the mostly widely accepted business paradigms is “know your customer” and a segmentation of the customer base allows for a perfect dissection of this paradigm.

  • Target Marketing: The most visible reason for customer segmentation is the ability to focus marketing efforts effectively and efficiently. If a firm knows the different segments of its customer base, it can devise better marketing campaigns which are tailor made for the segment. A good segmentation model allows for better understanding of customer requirements and hence increases the chances of the success of any marketing campaign developed by the organization.

  • Optimal Product Placement: A good customer segmentation strategy can also help the firm with developing or offering new products, or a bundle of products together as a combined offering.

  • Finding Latent Customer Segments: Finding out which segment of customers it might be missing to identifying untapped customer segments by focused on marketing campaigns or new business development.

Clustering:

The most obvious method to perform customer segmentation is using unsupervised Machine Learning methods like clustering. The method is as simple as collecting as much data about the customers as possible in the form of features or attributes and then finding out the different clusters that can be obtained from that data. Finally, we can find traits of customer segments by analyzing the characteristics of the clusters.

Exploratory Data Analysis:

Using exploratory data analysis is another way of finding out customer segments. This is usually done by analysts who have a good knowledge about the domain relevant to both products and customers. It can be done flexibly to include the top decision points in an analysis.

RFM Model for Customer Value:#

Since our dataset is limited to the sales records, and didn´t include anothers information about our customers, we will use a RFM,*Recency, Frequency and Monetary Value, based model of customer value for finding our customer segments. The RFM model will take the transactions of a customer and calculate three important informational attributes about each customer:

  • Recency: The value of how recently a customer purchased at the establishment

  • Frequency: How frequent the customer’s transactions are at the establishment

  • Monetary value: The dollar (or pounds in our case) value of all the transactions that the customer made at the establishment

Recency#

To create the recency feature variable, we need to decide the reference date for our analysis. Usually, we make use of the last transaction date plus one day. Then, we will construct the recency variable as the number of days before the reference date when a customer last made a purchase.

import datetime
import pandas as pd

# Calculate reference date
reference_date = cs_df.InvoiceDate.max() + pd.Timedelta(days=1)
print('Reference Date:', reference_date)

# Calculate days since last purchase
cs_df['days_since_last_purchase'] = (reference_date - cs_df.InvoiceDate) / pd.Timedelta(1, 'D')

# Calculate customer recency
customer_history_df = cs_df[['CustomerID', 'days_since_last_purchase']].groupby("CustomerID").min().reset_index()
customer_history_df.rename(columns={'days_since_last_purchase': 'recency'}, inplace=True)

# Display summary statistics
customer_history_df.describe().transpose()
Reference Date: 2011-12-10 12:50:00
count mean std min 25% 50% 75% max
CustomerID 4338.0 15300.408022 1721.808492 12346.0 13813.25 15299.5 16778.75 18287.0
recency 4338.0 93.049317 100.013298 1.0 18.072396 51.089931 142.730556 374.122917

We will plot the Recency Distribution and QQ-plot to identify substantive departures from normality, likes outliers, skewness and kurtosis.

def QQ_plot(data, measure):
    fig = plt.figure(figsize=(20,7))

    #Get the fitted parameters used by the function
    (mu, sigma) = norm.fit(data)

    #Kernel Density plot
    fig1 = fig.add_subplot(121)
    sns.distplot(data, fit=norm)
    fig1.set_title(measure + ' Distribution ( mu = {:.2f} and sigma = {:.2f} )'.format(mu, sigma), loc='center')
    fig1.set_xlabel(measure)
    fig1.set_ylabel('Frequency')

    #QQ plot
    fig2 = fig.add_subplot(122)
    res = probplot(data, plot=fig2)
    fig2.set_title(measure + ' Probability Plot (skewness: {:.6f} and kurtosis: {:.6f} )'.format(data.skew(), data.kurt()), loc='center')

    plt.tight_layout()
    plt.show()

QQ_plot(customer_history_df.recency, 'Recency')
_images/d89e95d34a8a4df7e9453cab04d191a5ca86a802fca06ab80d82b589bfe0ab76.png

From the first graph above we can see that sales recency distribution is skewed, has a peak on the left and a long tail to the right. It deviates from normal distribution and is positively biased.

From the Probability Plot, we could see that sales recency also does not align with the diagonal red line which represent normal distribution. The form of its distribution confirm that is a skewed right.

With skewness positive of 1.25, we confirm the lack of symmetry and indicate that sales recency are skewed right, as we can see too at the Sales Distribution plot, skewed right means that the right tail is long relative to the left tail. The skewness for a normal distribution is zero, and any symmetric data should have a skewness near zero. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.

Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers, and positive kurtosis indicates a heavy-tailed distribution and negative kurtosis indicates a light tailed distribution. So, with 0.43 of positive kurtosis sales recency are heavy-tailed and has some outliers.

Frequency#

customer_freq = (cs_df[['CustomerID', 'InvoiceNo']].groupby(["CustomerID", 'InvoiceNo']).count().reset_index()).\
                groupby(["CustomerID"]).count().reset_index()
customer_freq.rename(columns={'InvoiceNo':'frequency'},inplace=True)
customer_history_df = customer_history_df.merge(customer_freq)
QQ_plot(customer_history_df.frequency, 'Frequency')
_images/c2847df46196e22e236236046ac0a5f8a71364b5bafc06be498442e79ffb3640.png

From the first graph above we can see that sales frequency distribution is skewed, has a peak on the left and a long tail to the right. It deviates from normal distribution and is positively biased.

From the Probability Plot, we could see that sales frequency also does **not align with the diagonal and confirm that is a skewed right.

With skewness positive of 12.1, we confirm the high lack of symmetry and with 249 Kurtosis indicates that is a heavy-tailed distribution and has outliers.

Monetary Value#

customer_monetary_val = cs_df[['CustomerID', 'amount']].groupby("CustomerID").sum().reset_index()
customer_history_df = customer_history_df.merge(customer_monetary_val)
QQ_plot(customer_history_df.amount, 'Amount')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[2], line 1
----> 1 customer_monetary_val = cs_df[['CustomerID', 'amount']].groupby("CustomerID").sum().reset_index()
      2 customer_history_df = customer_history_df.merge(customer_monetary_val)
      3 QQ_plot(customer_history_df.amount, 'Amount')

NameError: name 'cs_df' is not defined

From the first graph above we can see that sales amount distribution is skewed, has a peak on the left and a long tail to the right. It deviates from normal distribution and is positively biased.

From the Probability Plot, we could see that sales amount also does not align with the diagonal, special on the right.

With skewness positive of 19.3, we confirm the high lack of symmetry and with 478 Kurtosis indicates that is a too heavy-tailed distribution and has outliers, surely more than 10 very extreme.

Let’s see a statistical summary of this dataset:

customer_history_df.describe()
CustomerID recency frequency amount
count 4338.0 4338.000000 4338.000000 4338.000000
mean 15300.408022 93.049317 4.272015 2054.266460
std 1721.808492 100.013298 7.697998 8989.230441
min 12346.0 1.000000 1.000000 3.750000
25% 13813.25 18.072396 1.000000 307.415000
50% 15299.5 51.089931 2.000000 674.485000
75% 16778.75 142.730556 5.000000 1661.740000
max 18287.0 374.122917 209.000000 280206.020000

Data Preprocessing#

Once we have created our customer value dataset, we will perform some preprocessing on the data. For our clustering, we will be using the K-means clustering algorithm. One of the requirements for proper functioning of the algorithm is the mean centering of the variable values. Mean centering of a variable value means that we will replace the actual value of the variable with a standardized value, so that the variable has a mean of 0 and variance of 1. This ensures that all the variables are in the same range and the difference in ranges of values doesn’t cause the algorithm to not perform well. This is akin to feature scaling.

Another problem that you can investigate about is the huge range of values each variable can take. This problem is particularly noticeable for the monetary amount variable. To take care of this problem, we will transform all the variables on the log scale. This transformation, along with the standardization, will ensure that the input to our algorithm is a homogenous set of scaled and transformed values.

An important point about the data preprocessing step is that sometimes we need it to be reversible. In our case, we will have the clustering results in terms of the log transformed and scaled variable. But to make inferences in terms of the original data, we will need to reverse transform all the variable so that we get back the actual RFM figures. This can be done by using the preprocessing capabilities of Python.

customer_history_df['recency_log'] = customer_history_df['recency'].apply(math.log)
customer_history_df['frequency_log'] = customer_history_df['frequency'].apply(math.log)
customer_history_df['amount_log'] = customer_history_df['amount'].apply(math.log)
feature_vector = ['amount_log', 'recency_log','frequency_log']
X_subset = customer_history_df[feature_vector] #.as_matrix()
scaler = preprocessing.StandardScaler().fit(X_subset)
X_scaled = scaler.transform(X_subset)
pd.DataFrame(X_scaled, columns=X_subset.columns).describe().T
count mean std min 25% 50% 75% max
amount_log 4338.0 -7.010426e-16 1.000115 -4.179280 -0.684183 -0.060942 0.654244 4.721395
recency_log 4338.0 9.827700e-18 1.000115 -2.747518 -0.655196 0.096033 0.838707 1.535305
frequency_log 4338.0 -9.991495e-17 1.000115 -1.048610 -1.048610 -0.279044 0.738267 4.882714
fig = plt.figure(figsize=(20,14))
f1 = fig.add_subplot(221); sns.regplot(x='recency', y='amount', data=customer_history_df)
f1 = fig.add_subplot(222); sns.regplot(x='frequency', y='amount', data=customer_history_df)
f1 = fig.add_subplot(223); sns.regplot(x='recency_log', y='amount_log', data=customer_history_df)
f1 = fig.add_subplot(224); sns.regplot(x='frequency_log', y='amount_log', data=customer_history_df)

fig = plt.figure(figsize=(15, 10))
ax = fig.add_subplot(111, projection='3d')

xs =customer_history_df.recency_log
ys = customer_history_df.frequency_log
zs = customer_history_df.amount_log
ax.scatter(xs, ys, zs, s=5)

ax.set_xlabel('Recency')
ax.set_ylabel('Frequency')
ax.set_zlabel('Monetary')

plt.show()
_images/542660f5ef5f0ad64e30fb7262702eb3c6bb9a8752d200bcacb2bafbca4366d3.png _images/3e5aa493430f902ddca2c541faac81c1389576149d881eb0adf38add48a0f3e1.png

The obvious patterns we can see from the plots above is that costumers who buy with a higher frequency and more recency tend to spend more based on the increasing trend in Monetary (amount value) with a corresponding increasing and decreasing trend for Frequency and Recency, respectively.

Clustering for Segments#

K-Means Clustering#

The K-means clustering belongs to the partition based\centroid based hard clustering family of algorithms, a family of algorithms where each sample in a dataset is assigned to exactly one cluster.

Based on this Euclidean distance metric, we can describe the k-means algorithm as a simple optimization problem, an iterative approach for minimizing the within-cluster sum of squared errors (SSE), which is sometimes also called cluster inertia. So, the objective of K-Means clustering is to minimize total intra-cluster variance, or, the squared error function: image

The steps that happen in the K-means algorithm for partitioning the data are as given follows:

  1. The algorithm starts with random point initializations of the required number of centers. The “K” in K-means stands for the number of clusters.

  2. In the next step, each of the data point is assigned to the center closest to it. The distance metric used in K-means clustering is normal Euclidian distance.

  3. Once the data points are assigned, the centers are recalculated by averaging the dimensions of the points belonging to the cluster.

  4. The process is repeated with new centers until we reach a point where the assignments become stable. In this case, the algorithm terminates.

K-means++#
  • Place the initial centroids far away from each other via the k-means++ algorithm, which leads to better and more consistent results than the classic k-means.

  • To use k-means++ with scikit-learn’s KMeans object, we just need to set the init parameter to k-means++ (the default setting) instead of random.

The Elbow Method#

Using the elbow method to find the optimal number of clusters. The idea behind the elbow method is to identify the value of k where the distortion begins to increase most rapidly. If k increases, the distortion will decrease, because the samples will be closer to the centroids they are assigned to.

This method looks at the percentage of variance explained as a function of the number of clusters. More precisely, if one plots the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The number of clusters is chosen at this point, hence the “elbow criterion”. This “elbow” cannot always be unambiguously identified.Percentage of variance explained is the ratio of the between-group variance to the total variance, also known as an F-test. A slight variation of this method plots the curvature of the within group variance.

cl = 50
corte = 0.1

anterior = 100000000000000
cost = [] 
K_best = cl

for k in range (1, cl+1):
    # Create a kmeans model on our data, using k clusters.  random_state helps ensure that the algorithm returns the same results each time.
    model = KMeans(
        n_clusters=k, 
        init='k-means++', #'random',
        n_init=10,
        max_iter=300,
        tol=1e-04,
        random_state=101)

    model = model.fit(X_scaled)

    # These are our fitted labels for clusters -- the first cluster has label 0, and the second has label 1.
    labels = model.labels_
 
    # Sum of distances of samples to their closest cluster center
    interia = model.inertia_
    if (K_best == cl) and (((anterior - interia)/anterior) < corte): K_best = k - 1
    cost.append(interia)
    anterior = interia

plt.figure(figsize=(8, 6))
plt.scatter(range (1, cl+1), cost, c='red')
plt.show()

# Create a kmeans model with the best K.
print('The best K sugest: ',K_best)
model = KMeans(n_clusters=K_best, init='k-means++', n_init=10,max_iter=300, tol=1e-04, random_state=101)

# Note I'm scaling the data to normalize it! Important for good results.
model = model.fit(X_scaled)

# These are our fitted labels for clusters -- the first cluster has label 0, and the second has label 1.
labels = model.labels_

# And we'll visualize it:
#plt.scatter(X_scaled[:,0], X_scaled[:,1], c=model.labels_.astype(float))
fig = plt.figure(figsize=(20,5))
ax = fig.add_subplot(121)
plt.scatter(x = X_scaled[:,1], y = X_scaled[:,0], c=model.labels_.astype(float))
ax.set_xlabel(feature_vector[1])
ax.set_ylabel(feature_vector[0])
ax = fig.add_subplot(122)
plt.scatter(x = X_scaled[:,2], y = X_scaled[:,0], c=model.labels_.astype(float))
ax.set_xlabel(feature_vector[2])
ax.set_ylabel(feature_vector[0])

plt.show()
_images/40d71666a3b82dfc7ed3a964f24d968b75c4f02e7598aa73ff5541cc76e3cf01.png
The best K sugest:  7
_images/64016dfc4fd0f17deb9ee8ca762658788cbee2860d4b1fa0d7cbe6c5fcd18147.png

Note that by the Elbow method from a K equal to 3 we already observed low rates of gain in the decay of the distortions with the decrease of K reaching the limit of 10% with the K equal to 7. With this in mind, we will begin to evaluate the options more deeply with 3, and 7, starting with the silhouette analysis.

Silhouette analysis on K-Means clustering#

Silhouette analysis can be used to study the separation distance between the resulting clusters, as a strategy to quantifying the quality of clustering via graphical tool to plot a measure of how tightly grouped the samples in the clusters are. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters and thus provides a way to assess parameters like number of clusters visually.

It can also be applied to clustering algorithms other than k-means

Silhouette coefficients has a range of [-1, 1], it calculated by:

  1. Calculate the cluster cohesion a( i )as the average distance between a sample x( i ) and all other points in the same cluster.

  2. Calculate the cluster separation b( i ) from the next closest cluster as the average distance between the sample x( i ) and all samples in the nearest cluster.

  3. Calculate the silhouette s( i ) as the difference between cluster cohesion and separation divided by the greater of the two, as shown here: image Which can be also written as: image

Where:

  • If near +1, it indicate that the sample is far away from the neighboring clusters.

  • a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.

  • If most objects have a high value, then the clustering configuration is appropriate.

  • If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.

  • A value of 0 indicates that the sample is on or very close to the decision boundary between two neighboring clusters

  • Negative values indicate that those samples might have been assigned to the wrong cluster.

The silhouette plot can shows a bad K clusters pick for the given data due to the presence of clusters with below average silhouette scores and also due to wide fluctuations in the size of the silhouette plots. A good k clusters can found when all the plots are more or less of similar thickness and hence are of similar sizes.

Although we have to keep in mind that in several cases and scenarios, sometimes we may have to drop the mathematical explanation given by the algorithm and look at the business relevance of the results obtained.

Let’s see below how our data perform for each K clusters groups (3, 5 and 7) in the silhouette score of each cluster, along with the center of each of the cluster discovered in the scatter plots, by amount_log vs recency_log and vs frequency_log.

cluster_centers = dict()

for n_clusters in range(3,K_best+1,2):
    fig, (ax1, ax2, ax3) = plt.subplots(1, 3)
    fig.set_size_inches(25, 7)
    ax1.set_xlim([-0.1, 1])
    ax1.set_ylim([0, len(X_scaled) + (n_clusters + 1) * 10])

    clusterer = KMeans(n_clusters=n_clusters, init='k-means++', n_init=10,max_iter=300, tol=1e-04, random_state=101)
    cluster_labels = clusterer.fit_predict(X_scaled)

    silhouette_avg = silhouette_score(X = X_scaled, labels = cluster_labels)
    cluster_centers.update({n_clusters :{'cluster_center':clusterer.cluster_centers_,
                                         'silhouette_score':silhouette_avg,
                                         'labels':cluster_labels}
                           })

    sample_silhouette_values = silhouette_samples(X = X_scaled, labels = cluster_labels)
    y_lower = 10
    for i in range(n_clusters):
        ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.Spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
    ax1.set_yticks([])
    ax1.set_xticks([-0.1, 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])
    colors = cm.Spectral(cluster_labels.astype(float) / n_clusters)
    
    centers = clusterer.cluster_centers_
    y = 0
    x = 1
    ax2.scatter(X_scaled[:, x], X_scaled[:, y], marker='.', s=30, lw=0, alpha=0.7, c=colors, edgecolor='k')   
    ax2.scatter(centers[:, x], centers[:, y], marker='o', c="white", alpha=1, s=200, edgecolor='k')
    for i, c in enumerate(centers):
        ax2.scatter(c[x], c[y], marker='$%d$' % i, alpha=1, s=50, edgecolor='k')
    ax2.set_title("{} Clustered data".format(n_clusters))
    ax2.set_xlabel(feature_vector[x])
    ax2.set_ylabel(feature_vector[y])

    x = 2
    ax3.scatter(X_scaled[:, x], X_scaled[:, y], marker='.', s=30, lw=0, alpha=0.7, c=colors, edgecolor='k')   
    ax3.scatter(centers[:, x], centers[:, y], marker='o', c="white", alpha=1, s=200, edgecolor='k')
    for i, c in enumerate(centers):
        ax3.scatter(c[x], c[y], marker='$%d$' % i, alpha=1, s=50, edgecolor='k')
    ax3.set_title("Silhouette score: {:1.2f}".format(cluster_centers[n_clusters]['silhouette_score']))
    ax3.set_xlabel(feature_vector[x])
    ax3.set_ylabel(feature_vector[y])
    
    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data with n_clusters = %d" % n_clusters),
                 fontsize=14, fontweight='bold')
    plt.show()
_images/206145a427eb1df3170f259ad8ef9b256828bc41c1022bb88e52b3197e3203c2.png _images/42af9ad7e800da0fbc01172cce086e521afcfc94450344c0020e9ab757d0107f.png _images/45a93532823d9593ac999d782feaee31517b473cfe73473198cb0eaf389665b0.png

When we look at the results of the clustering process, we can infer some interesting insights:

  • First notice that all K clusters options is valid, because they don’t have presence of clusters with below average silhouette scores.

  • In the other hand, all options had a some wide fluctuations in the size of the silhouette plots.

So, the best choice may lie on the option that gives us a simpler business explanation and at the same time target customers in focus groups with sizes closer to the desired.

Clusters Center:#

Let’s look at the cluster center values after returning them to normal values from the log and scaled version.

features = ['amount',  'recency',  'frequency']
for i in range(3,K_best+1,2):
    print("for {} clusters the silhouette score is {:1.2f}".format(i, cluster_centers[i]['silhouette_score']))
    print("Centers of each cluster:")
    cent_transformed = scaler.inverse_transform(cluster_centers[i]['cluster_center'])
    print(pd.DataFrame(np.exp(cent_transformed),columns=features))
    print('-'*50)
for 3 clusters the silhouette score is 0.33
Centers of each cluster:
        amount     recency  frequency
0   955.108413   35.220128   3.042955
1  3859.014223    8.763853   9.651473
2   259.270172  119.903139   1.183191
--------------------------------------------------
for 5 clusters the silhouette score is 0.31
Centers of each cluster:
        amount     recency  frequency
0  1161.573046   83.524872   3.155363
1  5952.818406    8.576574  13.683600
2   296.081689   31.515878   1.295759
3  1314.406149   11.107092   4.290844
4   231.257842  212.118311   1.144017
--------------------------------------------------
for 7 clusters the silhouette score is 0.31
Centers of each cluster:
         amount     recency  frequency
0   2134.498750    5.204186   6.460064
1    205.931613  226.183423   1.085344
2    240.188728   37.193024   1.130437
3    815.752704  108.048616   2.280582
4    666.191186   14.347347   2.667995
5   2408.833819   38.804916   5.996435
6  10200.920787    5.633035  20.695211
--------------------------------------------------

Clusters Insights:#

With the plots and the center in the correct units, let’s see some insights by each clusters groups:

In the three-cluster:

  • The tree clusters appears have a good stark differences in the Monetary value of the customer, we will confirm this by a box plot.

  • Cluster 1 is the cluster of high value customer who shops frequently and is certainly an important segment for each business.

  • In the similar way we obtain customer groups with low and medium spends in clusters with labels 0 and 2, respectively.

  • Frequency and Recency correlate perfectly to the Monetary value based on the trend (High Monetary-Low Recency-High Frequency).

In the five-cluster:

  • Note that clusters 0 and 1 are very similar to their cluster in the configuration with only 3 clusters.

  • The cluster 1 appears more robust on the affirmation of those who shop often and with high amount.

  • The cluster 2 are those who have a decent spend but are not as frequent as the cluster 1

  • The cluster 4 purchases medium amounts, with a relatively low frequency and not very recent

  • The cluster 3 makes low-cost purchases, with a relatively low frequency, but above 1, and made their last purchase more recently. This group of customers probably response to price discounts and can be subject to loyalty promotions to try increase the medium-ticket, strategy that can be better defined when we analyzing the market basket.

  • The silhouette score matrix says that the five cluster segments are less optimal then the three cluster segments.

In the five-cluster:

  • Definitely cluster 6 defines those who shop often and with high amount.

  • Clusters 1 and 5 show good spending and good frequency, only deferring in how recent were their last purchases, where 5 is older, which suggests an active action to sell to group 5 as soon as possible and another to 1 seeking to raise its frequency.

  • Cluster 0 presents the fourth best purchase and a reasonable frequency, but this is a long time without buying. This group should be sensible to promotions and activations, so that they do not get lost and make their next purchase.

  • Cluster 5 is similar to 0, but has made its purchases more recently and has a slightly better periodicity. Then actions must be taken to raise their frequency and reduce the chances of them migrating to cluster 0 by staying longer without purchasing products.

Drill Down Clusters:#

To further drill down on this point and find out the quality of these difference, we can label our data with the corresponding cluster label and then visualize these differences. The following code will extract the clustering label and attach it with our customer summary dataset.

customer_history_df['clusters_3'] = cluster_centers[3]['labels'] 
customer_history_df['clusters_5'] = cluster_centers[5]['labels']
customer_history_df['clusters_7'] = cluster_centers[7]['labels']
display(customer_history_df.head())

fig = plt.figure(figsize=(20,7))
f1 = fig.add_subplot(131)
market = customer_history_df.clusters_3.value_counts()
g = plt.pie(market, labels=market.index, autopct='%1.1f%%', shadow=True, startangle=90)
plt.title('3 Clusters')
f1 = fig.add_subplot(132)
market = customer_history_df.clusters_5.value_counts()
g = plt.pie(market, labels=market.index, autopct='%1.1f%%', shadow=True, startangle=90)
plt.title('5 Clusters')
f1 = fig.add_subplot(133)
market = customer_history_df.clusters_7.value_counts()
g = plt.pie(market, labels=market.index, autopct='%1.1f%%', shadow=True, startangle=90)
plt.title('7 Clusters')
plt.show()
CustomerID recency frequency amount recency_log frequency_log amount_log clusters_3 clusters_5 clusters_7
0 12346 326.117361 1 77183.60 5.787257 0.000000 11.253942 0 0 5
1 12347 2.873611 7 4310.00 1.055569 1.945910 8.368693 1 1 0
2 12348 75.984028 4 1797.24 4.330523 1.386294 7.494007 0 0 5
3 12349 19.124306 1 1757.55 2.950960 0.000000 7.471676 0 2 4
4 12350 310.867361 1 334.40 5.739366 0.000000 5.812338 2 4 1
_images/d0e47326a4335ecff4bf364afb0142c93d533e3b6f34a5178197dad89acdb97d.png

Once we have the labels assigned to each of the customers, our task is simple. Now we want to find out how the summary of customer in each group is varying. If we can visualize that information we will able to find out the differences in the clusters of customers and we can modify our strategy on the basis of those differences.

The following code leverages plotly and will take the cluster labels we got for each configurations clusters and create boxplots. Plotly enables us to interact with the plots to see the central tendency values in each boxplot in the notebook. Note that we want to avoid the extremely high outlier values of each group, as they will interfere in making a good observation around the central tendencies of each cluster. Since we have only positive values, we will restrict the data such that only data points which are less than 0.95th percentile of the cluster is used. This will give us good information about the majority of the users in that cluster segment.

I’ve used these charts to review my previously stated insights, but follow the same for you to explore:

x_data = ['Cluster 0', 'Cluster 1','Cluster 2','Cluster 3','Cluster 4', 'Cluster 5', 'Cluster 6']
colors = ['rgba(93, 164, 214, 0.5)', 'rgba(255, 144, 14, 0.5)', 'rgba(44, 160, 101, 0.5)', 'rgba(255, 65, 54, 0.5)', 
          'rgba(22, 80, 57, 0.5)', 'rgba(127, 65, 14, 0.5)', 'rgba(207, 114, 255, 0.5)', 'rgba(127, 96, 0, 0.5)']
cutoff_quantile = 95

for n_clusters in range(3,K_best+1,2):
    cl = 'clusters_' + str(n_clusters)
    for fild in range(0, 3):
        field_to_plot = features[fild]        
        y_data = list()
        ymax = 0
        for i in np.arange(0,n_clusters):
            y0 = customer_history_df[customer_history_df[cl]==i][field_to_plot].values
            y0 = y0[y0<np.percentile(y0, cutoff_quantile)]
            if ymax < max(y0): ymax = max(y0)
            y_data.insert(i, y0)

        traces = []

        for xd, yd, cls in zip(x_data[:n_clusters], y_data, colors[:n_clusters]):
                traces.append(go.Box(y=yd, name=xd, boxpoints=False, jitter=0.5, whiskerwidth=0.2, fillcolor=cls,
                    marker=dict( size=1, ),
                    line=dict(width=1),
                ))

        layout = go.Layout(
            title='Difference in {} with {} Clusters and {:1.2f} Score'.\
            format(field_to_plot, n_clusters, cluster_centers[n_clusters]['silhouette_score']),
            yaxis=dict( autorange=True, showgrid=True, zeroline=True,
                dtick = int(ymax/10),
                gridcolor='black', gridwidth=0.1, zerolinecolor='rgb(255, 255, 255)', zerolinewidth=2, ),
            margin=dict(l=40, r=30, b=50, t=50, ),
            paper_bgcolor='white',
            plot_bgcolor='white',
            showlegend=False
        )

        fig = go.Figure(data=traces, layout=layout)
     
   py.offline.iplot(fig)

Next Steps in the Segmentation:#

To enhance discovery and can further improve the quality of clustering by adding relevant features, other customer information and purchases details may be included in this dataset.

For example, but not limited to:

  • New indicators, such as customer relationship time, based on the date of your first purchase of the client

  • whether the customer is from abroad or not

  • some group or category of product to be obtained through the SKUs

  • External data vendors and use it, and so on.

Another dimension to explore can be trying out different algorithms for performing the segmentation for instance hierarchical clustering, which we explored in some of the earlier chapters. A good segmentation process will encompass all these avenues to arrive at optimal segments that provide valuable insight.

Cross Selling#

The cross selling is the ability to sell more products to a customer by analyzing the customer’s shopping trends as well as general shopping trends and patterns which are in common with the customer’s shopping patterns. More often than not, these recommended products would be very appealing. The retailer will often offer you a bundle of products with some attractive offer and it is highly likely that we will end up buying the bundled products instead of just the original item.

So, we research the customer transactions and find out potential additions to the customer’s original needs and offer it to the customer as a suggestion in the hope and intent that they buy them benefiting both the customer as well as the retail establishment.

In this section, we explore association rule-mining, a powerful technique that can be used for cross selling, then we apply the concept of market basket analysis to our retail transactions dataset.

Market Basket Analysis with Association Rule-Mining#

image The whole concept of association rule-mining is based on the concept that customer purchase behavior has a pattern which can be exploited for selling more items to the customer in the future.

Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness. This rule-based approach also generates new rules as it analyzes more data. The ultimate goal, assuming a large enough dataset, is to help a machine mimic the human brain’s feature extraction and abstract association capabilities from new uncategorized data.

An association rule usually has the structure like below: image This rule can be read in the obvious manner that when the customer bought items on the left of the rule he is likely to buy the item on the right. Following are some vital concepts pertaining to association rule-mining.

  • Itemset: Is just a collection of one or more items that occur together in a transaction. For example, here {milk, bread} is example of an itemset.

  • Support: is defined as number of times an itemset appears in the dataset. The support of X with respect to T is defined as the proportion of transactions t in the dataset which contains the itemset X. Mathematically it is defined as: image

  • Confidence: Confidence is an indication of how often the rule has been found to be true. It is a measure of the times the number of times a rule is found to exist in the dataset. For a rule which states { beer -> diaper } the confidence is mathematically defined as: image

  • Lift: Lift of the rule is defined as the ratio of observed support to the support expected in the case the elements of the rule were independent. For the previous set of transactions if the rule is defined as { X -> Y }, then the lift of the rule is defined as: image

    • If the rule had a lift of 1, it would imply that the probability of occurrence of the antecedent and that of the consequent are independent of each other. When two events are independent of each other, no rule can be drawn involving those two events.

    • If the lift is > 1, that lets us know the degree to which those two occurrences are dependent on one another, and makes those rules potentially useful for predicting the consequent in future data sets.

    • If the lift is < 1, that lets us know the items are substitute to each other. This means that presence of one item has negative effect on presence of other item and vice versa.

  • Frequent itemset: Frequent itemsets are itemsets whose support is greater than a user defined support threshold.

  • Conviction: Is the ratio of the expected frequency that item X occurs without a item Y (that is to say, the frequency that the rule makes an incorrect prediction) if X and Y were independent divided by the observed frequency of incorrect predictions. The conviction of a rule is defined as: image

Algorithms:#

Some well-known algorithms are Apriori, Eclat and FP-Growth, but they only do half the job, since they are algorithms for mining frequent itemsets. Another step needs to be done after to generate rules from frequent itemsets found in a database.

The major bottleneck in any association rule-mining algorithm is the generation of frequent itemsets. If the transaction dataset is having k unique products, then potentially we have 2k possible itemsets.

Apriori#

Apriori uses a breadth-first search strategy to count the support of itemsets and uses a candidate generation function which exploits the downward closure property of support. So, the algorithm will first generate these itemsets and then proceed to finding the frequent itemsets. For around 100 unique products the possible number of itemsets is huge, and shows up that the Apriori algorithm prohibitively computationally expensive.

Eclat algorithm#

Eclat is a depth-first search algorithm based on set intersection. It is suitable for both sequential as well as parallel execution with locality-enhancing properties.

FP Growth#

FP stands for frequent pattern. The FP growth algorithm is superior to Apriori algorithm as it doesn’t need to generate all the candidate itemsets. The FP growth algorithm uses a divide-and-conquer strategy and leverages a special data structure called the FP-tree, to find frequent itemsets without generating all itemsets. The core steps of the algorithm are as follows:

  1. In the first pass, the algorithm take in the transactional database and counts occurrence of items (attribute-value pairs) in the dataset, and stores them to ‘header table’.

  2. In the second pass, it builds the FP-tree structure by inserting instances to represent frequent itemsets. Items in each instance have to be sorted by descending order of their frequency in the dataset, so that the tree can be processed quickly. Items in each instance that do not meet minimum coverage threshold are discarded. If many instances share most frequent items, FP-tree provides high compression close to tree root.

  3. Divide this compressed representation into multiple conditional datasets such that each one is associated with a frequent pattern.

  4. Mine for patterns in each such dataset so that shorter patterns can be recursively concatenated to longer patterns, hence making it more efficient.

Recursive processing of this compressed version of main dataset grows large item sets directly, instead of generating candidate items and testing them against the entire database. Growth starts from the bottom of the header table (having longest branches), by finding all instances matching given condition. New tree is created, with counts projected from the original tree corresponding to the set of instances that are conditional on the attribute, with each node getting sum of its children counts. Recursive growth ends when no individual items conditional on the attribute meet minimum support threshold, and processing continues on the remaining header items of the original FP-tree.

Once the recursive process has completed, all large item sets with minimum coverage have been found, and association rule creation begins.

Build Transaction Dataset#

In order to perform our data in these algorithms, we must first turn them into a sales event table where each product sold will be represented by a column, having its value 1 for when it was sold in that event or zero when not. This will generate a sparse table

items = list(cs_df.Description.unique())
grouped = cs_df.groupby('InvoiceNo')
transaction_level = grouped.aggregate(lambda x: tuple(x)).reset_index()[['InvoiceNo','Description']]
transaction_dict = {item:0 for item in items}
output_dict = dict()
temp = dict()
for rec in transaction_level.to_dict('records'):
    invoice_num = rec['InvoiceNo']
    items_list = rec['Description']
    transaction_dict = {item:0 for item in items}
    transaction_dict.update({item:1 for item in items if item in items_list})
    temp.update({invoice_num:transaction_dict})

new = [v for k,v in temp.items()]
transaction_df = pd.DataFrame(new)

Prune Dataset for frequently purchased items#

We saw in the earlier on EDA how only a handful of items are responsible for bulk of our sales so we want to prune our dataset to reflect this information. For this we have created a function prune_dataset below, which will help us reduce the size of our dataset based on our requirements. The function can be used for performing three types of pruning:

  • Pruning based on percentage of total sales: The parameter total_sales_perc will help us select the number of items that will explain the required percentage of sales. The default value is 50% or 0.5.

  • Pruning based on ranks of items: Another way to perform the pruning is to specify the starting and the ending rank of the items for which we want to prune our dataset.

  • Pruning based on list of features passed to the parameter TopCols.

By default, we will only look for transactions which have at least two items, as transactions with only one item are counter to the whole concept of association rule-mining.

def prune_dataset(input_df, length_trans = 2, total_sales_perc = 0.5, 
                  start_item = None, end_item = None, TopCols = None):
    if 'total_items' in input_df.columns:
        del(input_df['total_items'])
    item_count = input_df.sum().sort_values(ascending = False).reset_index()
    total_items = sum(input_df.sum().sort_values(ascending = False))
    item_count.rename(columns={item_count.columns[0]:'item_name',
                               item_count.columns[1]:'item_count'}, inplace=True)
    if TopCols: 
        input_df['total_items'] = input_df[TopCols].sum(axis = 1)
        input_df = input_df[input_df.total_items >= length_trans]
        del(input_df['total_items'])
        return input_df[TopCols], item_count[item_count.item_name.isin(TopCols)]
    elif end_item > start_item:
        selected_items = list(item_count[start_item:end_item].item_name)
        input_df['total_items'] = input_df[selected_items].sum(axis = 1)
        input_df = input_df[input_df.total_items >= length_trans]
        del(input_df['total_items'])
        return input_df[selected_items],item_count[start_item:end_item]
    else:
        item_count['item_perc'] = item_count['item_count']/total_items
        item_count['total_perc'] = item_count.item_perc.cumsum()
        selected_items = list(item_count[item_count.total_perc < total_sales_perc].item_name)
        input_df['total_items'] = input_df[selected_items].sum(axis = 1)
        input_df = input_df[input_df.total_items >= length_trans]
        del(input_df['total_items'])
        return input_df[selected_items], item_count[item_count.total_perc < total_sales_perc]

We use the second option of pruning, by the Top 15th products in sales events.

output_df, item_counts = prune_dataset(input_df=transaction_df, length_trans=2,start_item = 0, end_item = 15)
print('Total of Sales Amount by the Top 15 Products in Sales Events (Invoice): {:.2f}'.format(AmoutSum[Top15ev].sum()))
print('Number of Sales Events:', output_df.shape[0])
print('Number of Products:', output_df.shape[1])

item_counts
Total of Sales Amount by the Top 15 Products in Sales Events (Invoice): 778377.21
Number of Sales Events: 4664
Number of Products: 15
item_name item_count
0 WHITE HANGING HEART T-LIGHT HOLDER 1978
1 REGENCY CAKESTAND 3 TIER 1703
2 JUMBO BAG RED RETROSPOT 1600
3 PARTY BUNTING 1379
4 ASSORTED COLOUR BIRD ORNAMENT 1375
5 LUNCH BAG RED RETROSPOT 1289
6 SET OF 3 CAKE TINS PANTRY DESIGN 1146
7 POSTAGE 1099
8 JUMBO BAG VINTAGE DOILY 1080
9 LUNCH BAG BLACK SKULL. 1052
10 LUNCH BAG SUKI DESIGN 1043
11 POPCORN HOLDER 1035
12 PACK OF 72 RETROSPOT CAKE CASES 1029
13 SPOTTY BUNTING 1009
14 LUNCH BAG VINTAGE DOILY 1006

So we find out that we have 15 items responsible for 8,73% of sales amount and close to 5% of the events result in 4.664 transactions that have those items along with other items. The next step is to convert this selected data into the required table data structure.

Association Rule Mining with FP Growth#

Orange Table Data Structure#

Since we are using the Orage framework we still have to convert it to the Table data structure by providing the metadata about our columns. We need to define the domain for each of our variables. The domain means the possible set of values that each of our variables can use. This information will be stored as metadata and will be used in later transformation of the data. As our columns are only having binary values,we can easily create the domain by using this information.

# Definisi domain data dengan menyatakan setiap variabel sebagai DiscreteVariable dengan nilai (0, 1)
domain_transac = Domain([DiscreteVariable.make(name=item, values=['0', '1']) \
                         for item in input_assoc_rules.columns])

# Membuat struktur data menggunakan domain yang telah didefinisikan
data_tran = Orange.data.Table.from_numpy(domain=domain_transac,  
                                         X=input_assoc_rules.to_numpy(), Y=None)

# Kode untuk mengkodekan seluruh domain sebagai variabel biner
data_tran_en, mapping = OneHot.encode(data_tran, include_class=True)
Creating our rules#

We need to specify two pieces of information for generating our rules: support and confidence. An important piece of information is to start with a higher support, as lower support will mean a higher number of frequent itemsets and hence a longer execution time. We will specify a min support of 0.01 and see the number of frequent itemsets that we get before we specify confidence and generate our rules.

support = 0.01
print("num of required transactions = ", int(input_assoc_rules.shape[0]*support))
num_trans = input_assoc_rules.shape[0]*support
itemsets = dict(frequent_itemsets(data_tran_en, support))
print('Items Set Size:', len(itemsets))
num of required transactions =  46
Items Set Size: 663273

So we get a whopping 663.273 itemsets for only 15 itens and a support of only 1%! This will increase exponentially if we decrease the support or if we increase the number of items in our dataset. The next step is specifying a confidence value and generating our rules. The following code snippet will perform rule generation and decoding of rules, and then compile it all in a neat dataframe that we can use for further analysis.

confidence = 0.6
rules_df = pd.DataFrame()
if len(itemsets) < 1000000: 
    rules = [(P, Q, supp, conf)
    for P, Q, supp, conf in association_rules(itemsets, confidence)
       if len(Q) == 1 ]

    names = {item: '{}={}'.format(var.name, val)
        for item, var, val in OneHot.decode(mapping, data_tran, mapping)}
    
    eligible_ante = [v for k,v in names.items() if v.endswith("1")]
    
    N = input_assoc_rules.shape[0]
    
    rule_stats = list(rules_stats(rules, itemsets, N))
    
    rule_list_df = []
    for ex_rule_frm_rule_stat in rule_stats:
        ante = ex_rule_frm_rule_stat[0]            
        cons = ex_rule_frm_rule_stat[1]
        named_cons = names[next(iter(cons))]
        if named_cons in eligible_ante:
            rule_lhs = [names[i][:-2] for i in ante if names[i] in eligible_ante]
            ante_rule = ', '.join(rule_lhs)
            if ante_rule and len(rule_lhs)>1 :
                rule_dict = {'support' : ex_rule_frm_rule_stat[2],
                             'confidence' : ex_rule_frm_rule_stat[3],
                             'coverage' : ex_rule_frm_rule_stat[4],
                             'strength' : ex_rule_frm_rule_stat[5],
                             'lift' : ex_rule_frm_rule_stat[6],
                             'leverage' : ex_rule_frm_rule_stat[7],
                             'antecedent': ante_rule,
                             'consequent':named_cons[:-2] }
                rule_list_df.append(rule_dict)
    rules_df = pd.DataFrame(rule_list_df)
    print("Raw rules data frame of {} rules generated".format(rules_df.shape[0]))
    if not rules_df.empty:
        pruned_rules_df = rules_df.groupby(['antecedent','consequent']).max().reset_index()
    else:
        print("Unable to generate any rule")
Raw rules data frame of 25247 rules generated
Explore The Association Rule Created#

Let’s see what we get in the first 5 rules with highest confidence:

dw = pd.options.display.max_colwidth
pd.options.display.max_colwidth = 100
(rules_df[['consequent', 'antecedent', 'support','confidence','lift']].\
 groupby(['consequent', 'antecedent']).first()
                                      .reset_index()
                                      .sort_values(['confidence', 'support', 'lift'],
                                                   ascending=False)).head()
consequent antecedent support confidence lift
20 JUMBO BAG VINTAGE DOILY JUMBO BAG RED RETROSPOT, LUNCH BAG RED RETROSPOT, LUNCH BAG VINTAGE DOILY 60 0.909091 4.520256
133 LUNCH BAG VINTAGE DOILY LUNCH BAG RED RETROSPOT, JUMBO BAG VINTAGE DOILY , LUNCH BAG BLACK SKULL., LUNCH BAG SUKI DESIGN 49 0.890909 4.596460
134 LUNCH BAG VINTAGE DOILY LUNCH BAG RED RETROSPOT, JUMBO BAG VINTAGE DOILY , LUNCH BAG SUKI DESIGN 48 0.888889 4.586037
131 LUNCH BAG VINTAGE DOILY LUNCH BAG RED RETROSPOT, JUMBO BAG VINTAGE DOILY , LUNCH BAG BLACK SKULL. 48 0.872727 4.502655
21 JUMBO BAG VINTAGE DOILY JUMBO BAG RED RETROSPOT, LUNCH BAG SUKI DESIGN , LUNCH BAG VINTAGE DOILY 48 0.872727 4.339446

Now, the first 5 higest support:

(rules_df[['consequent', 'antecedent', 'support','confidence','lift']].\
 groupby(['consequent', 'antecedent']).first()
                                      .reset_index()
                                      .sort_values(['support', 'confidence', 'lift'],
                                                   ascending=False)).head()
consequent antecedent support confidence lift
123 LUNCH BAG VINTAGE DOILY JUMBO BAG VINTAGE DOILY , LUNCH BAG RED RETROSPOT 154 0.777778 4.012783
28 JUMBO BAG VINTAGE DOILY LUNCH BAG VINTAGE DOILY , JUMBO BAG RED RETROSPOT 153 0.805263 4.003995
92 LUNCH BAG RED RETROSPOT LUNCH BAG VINTAGE DOILY , LUNCH BAG SUKI DESIGN 149 0.668161 2.672646
77 LUNCH BAG RED RETROSPOT LUNCH BAG BLACK SKULL., LUNCH BAG SUKI DESIGN 142 0.601695 2.406780
4 JUMBO BAG RED RETROSPOT JUMBO BAG VINTAGE DOILY , LUNCH BAG RED RETROSPOT 138 0.696970 2.688723

Typically, a lift value of 1 indicates that the probability of occurrence of the antecedent and consequent together are independent of each other. Hence, the idea is to look for rules having a lift much greater than 1. So, let’s see how much rules has lift greater than 1, equal 1 and less than one:

rules_df.lift.apply(lambda x: 'Greater Than One' if x > 1 else 'One' \
                           if x == 0 else 'Less Than One').value_counts()
lift
Greater Than One    25247
Name: count, dtype: int64

So all rules are valid.

pd.options.display.max_colwidth = dw

Conclusion#

As we have seen, the generation of rules is a very simple process, but very computationally expensive, since it grows exponentially with the increase of the set of items.

Overall, we seek the proper balance between support and confidence leading to a reasonable number of strong rules.

In the other hand, if the goal is to identify rare but with high confidence patterns, we should proceed as previously, by establishing a low level of support and a higher level of confidence, which leads to a large number of rules.

With this in mind, the rules with low support and high confidence would then be our target for further study and than outlining of strategies to raise cross selling.